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Abstract: We address the sequential change-point detection problem for the Gaussian model where baseline distri- 
bution is Gaussian with variance and mean [i such that = a/i, where a > is a known constant; the change 
is in /i from one known value to another First, we carry out a comparative performance analysis of four detection 
procedures; the CUSUM procedure, the Shiryaev-Roberts (SR) procedure, and two its modifications - the Shiryaev- 
Roberts-Pollak and Shiryaev-Roberts-r procedures. The performance is benchmarked via Pollak's maximal average 
delay to detection and Shiryaev's stationary average delay to detection, each subject to a fixed average run length 
to false alarm. The analysis shows that in practically interesting cases the accuracy of asymptotic approximations is 
"reasonable" to "excellenf '. We also consider an application of change-point detection to cybersecurity - for rapid 
anomaly detection in computer networks. Using real network data we show that statistically traffic's intensity can be 
well-described by the proposed Gaussian model with — afi instead of the traditional Poisson model, which requires 
(T^ = /i. By successively devising the SR and CUSUM procedures to "catch" a low-contrast network anomaly (caused 
by an ICMP reflector attack), we then show that the SR rule is quicker We conclude that the SR procedure is a better 
cyber "watch dog" than the popular CUSUM procedure. 

Keywords: Anomaly detection; Cybersecurity; CUSUM procedure; Intrusion detection; Sequential analysis; Sequen- 
tial change-point detection; Shiryaev-Roberts procedure; Shiryaev-Roberts-Pollak procedure; Shiryaev-Roberts-r 
procedure. 

Subject Classifications: 62L10; 62L15; 62P30. 
1. INTRODUCTION 

Sequential change-point detection is concerned with the design and analysis of techniques for fastest (on- 
line) detection of a change in the state of a process, subject to a tolerable limit on the risk of committing a 
false detection. The basic iid version of the problem considers a series of independent random observations, 
Xi,X2, . . ., which initially follow a common, known pdf f{x), but subsequent to some unknown time index 
v all adhere to a common pdf g{x) ^ f{x), also known; the unknown time index, u, is referred to as the 
change-point, or the point of "disorder". The objective is to decide after each new observation whether the 
observations' common pdf is currently f{x), and either continue taking more data, if the decision is positive, 
or stop and trigger an "alarm", if the decision is negative. At each stage, the decision is made based solely on 
the data observed up to that point. The problem is that the change is to be detected with as few observations 
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as possible past the true change-point, which must be balanced against the risk of sounding a "false alaim" - 
stopping prematurely as a result of an erroneously made conclusion that the change did occur, while, in fact, 
it never did. That is, the essence of the problem is to reach a tradeoff between the loss associated with the 
detection delay and the loss associated with false alarms. A good sequential detection procedure is expected 
to minimize the average detection delay, subject to a constraint on the false alarm risk. 
The particular change-point scenario we study in this work assumes that 

f[x) = — exp < > and qix) = , exp < — tt— f , (1-1) 

where 0<a<oo, O</i7^0<oo, and — oo < x < oo. We will refer to this model as the A/'(/z, a/i)-to- 
A/'(^, aO) model. This model can be motivated by several interesting application areas. 

First, it arises in the area of Statistical Process Control (SPC) in a sequential estimation context. Sup- 
pose a process working in the background on day i ^ 1 produces iid data {Xj with unknown mean, 
^, and unknown variance, cr^. Per an internal SPC protocol one would like to estimate /i by construct- 
ing a confidence interval for it of prescribed width 2d, d > 0, and prescribed confidence level 1 — a, 
< a < 1. If cj^ were known, then the required asymptotically (as d — )• 0) optimal (fixed) sample 
size would be N*^ = \z^^i^a^ j where \x\ is the smallest integer not less than x. Specifically, we 
would have \mi^_^i^\S'N'^ J z^^i^a'^^ = 1, and therefore, limrf_j.o Pd^^r* ^ — ^ d) = 1 — a, where 
Xn = Yl'i=i-^i "^he sample mean. However, since o"^ is unknown. Chow and Robbins (1965) 
proposed the following purely sequential strategy. Let 

Na,d{i) = inf |n ^ mi{^ 2) : n ^ ^a/2-^| ' 

where {Sfj^}n^2 is the sample variance computed from {Xi^n}n^i on day i ^ 1, and rrii is the pilot sample 
size on day i ^ 1. At the end of the i-th day, we get to observe N* ^{i), and thus obtain an estimate of A'^^ ^. 
Note that ^^^^^(i), i ^ 1, is finite w.p. 1. Furthermore, under the sole assumption that < < oo. Chow 
and Robbins (1965) show that Na,d{i) is asymptotically (as d — )• 0) efficient and consistent. The same result 
when the data are Gaussian was also established by Anscombe (1952, 1953). Now, by Mukhopadhyay and 
Solanky (1994, Theorem 2.4.1, p. 41), first established by Ghosh and Mukhopadhyay (1975), as d 0, 
the distribution of Na,d{i) converges to Gaussian with mean N*^ and variance aN*^, where a > is 
foundable explicitly; for instance, a = 2 when the data are Gaussian. One may additionally refer to Ghosh 
et al. (1997, Exercise 2.7.4, p. 66) and Mukhopadhyay and de Silva (2009). 

We note that when sampling in multi-steps or under some sort of acceleration parameter < p < 1, this 
"a" is a function of p, and captures how much sampling is done purely sequentially before it is augmented 
by batch sampling. See, e.g., Ghosh et al. (1997, Chapter 6). 

The main concern is cr^, that is, by trying to hold N* ^ within reason, one may want to detect changes 
in A'^^ ^ over days. Put otherwise, what if we suddenly see a trend of "larger or smaller than usual" values 
of Na,d{i), i ^ 1, i.e., estimates of N* ^? As d — 0, this becomes precisely model (1.1). This discussion 
remains unchanged even when the mean p = pi, i ^ 1, is not the same over days. 

Second, model (1.1) can find application in telecommunications. Let Xn and y„ be the number of calls 
made to location 1 and 2, respectively, on day n ^ 1. Assume that {Xn}n^i and {l^njnSsi are each iid 
Poisson with rate A > 0. At location 1, the number of calls, Xn is recorded internally during the whole 
day's operation between 8:00 AM to 6:00 PM. At location 2, however, the number of calls can be monitored 
only during half-the-day, between 8:00 AM to 1:00 PM, e.g., for lack of appropriate fund or staff. How 
should one model the aggregate number of calls, i.e., the number of calls from lines 1 and 2 combined? 
Since Xn + Yn cannot be observed due to lack of PM staffing, one may instead consider [/„ = Xn + Yn/2. 



2 



This is reasonable provided the calls are divided approximately equally between the morning and afternoon 
shifts. For moderately large A, however, Un will be approximately Gaussian with mean 3A/2 and variance 
5 A/4. Thus, at the end of day i ^ 1 one respectively has independent observations Ui,U2, ■ ■ ■ recorded 
from a J\f{fJ,, afi) distribution with = 3A/2, a = 5/6. 

Third, the possibility of approximating the Poisson distribution with Gaussian makes model (1.1) of use 
in the area of cybersecurity, specifically in the field of rapid volume-type anomaly detection in computer net- 
works traffic. A volume-type traffic anomaly can be caused by many reasons, e.g., by an attack (intrusion), a 
virus, a flash crowd, a power outage, or misconfigured network equipment. In all cases it is associated with 
a sudden change (usually, from lower to higher) in the traffic's volume characteristics; the latter may be de- 
fined as, e.g., the traffic's intensity measured via the packet rate - the number of network packets transmitted 
through the link per time unit. This number randomly varies with time, and according to latest findings, its 
behavior can be well described by a Poisson process; see, e.g., Cao et al. (2002), Kai^agiannis et al. (2004), 
and Vishwanathy et al. (2009). In this case, the average packet rate is nothing but the amval rate of the Pois- 
son process, which undergoes the upsurge triggered by an anomaly. Whether prior to or during an anomaly, 
the average packet rate typically measures in the thousands at the lowest, and therefore the parameter of the 
Poisson distribution is usually rather large. Hence, one can accurately approximate the Poisson distribution 
with a Gaussian one whose mean and variance are both equal to the average packet rate. This is precisely 
the AA(/i, a fi) -to- J\f {9, aO) model with a = 1. However, in general, behavior of real traffic may deviate from 
the Poisson model, and parameter a in the AA(/i, afi)-to-M{9, aO) model can be used as an extra "degree of 
freedom" allowing one to take these deviations into account. Thus, the A/'(/i, afi)-to-M{9, aO) model is a 
more general and better option. 

The main objective of this work is to carry out a peer-to-peer comparative multiple-measure performance 
analysis of four detection procedures: the CUSUM procedure, the Shiryaev-Roberts (SR) procedure, and 
two of its derivatives - the Shiryaev-Roberts-Pollak (SRP) procedure and the Shiryaev-Roberts-r (SR-r) 
procedure. The performance measures we ai^e interested in are Pollak's (1985) Supremum Average Detection 
Delay (ADD) and Shiryaev's (1963) Stationary ADD - each subject to a tolerable lower bound, 7, on the 
Average Run Length (ARL) to false alarm. Our intent is to test the asymptotic (as 7 — )• 00) approximations 
for each performance index and each procedure of interest. 

The remainder of the paper is organized as follows. In Section 2 we formally state the problem, introduce 
the performance measures and the four procedures; for those with exact optimality properties we also remind 
these properties. In Section 3 we discuss the asymptotic properties of the detection procedures and present 
the corresponding asymptotic performance approximations. Section 4 describes the methodology of eval- 
uating operating characteristics. Section 5 is devoted to the AA(/x, a jj,) -to- J\f {9, aO) change-point scenario. 
Specifically, using the methodology of Section 4, we study the performance of each of the four procedures 
and examine the accuracy of the corresponding asymptotic approximations. Section 6 is intended to illus- 
trate how change-point detection in general and the A/'(/Li, a fi) -to- J\f {9, a9) model in particular can be used 
in cybersecurity for rapid anomaly detection in computer networks. Lastly, Section 7 draws conclusions. 

2. OPTIMALITY CRITERIA AND DETECTION PROCEDURES 

As indicated in the introduction, the focus of this work is on two formulations of the change-point detection 
problem - the minimax formulation and that related to multi-cyclic disorder detection in a stationary regime. 
The two, though both stem from the same assumption that the change-point is an unknown (non-random) 
number, posit their own optimality criterion. 

Suppose there is an "experimenter" who is able to gather, one by one, a series of independent random 
observations, {Xn]n^i- Statistically, the series is such that, for some time index v, which is refeiTcd to 
as the change-point, Xi, X2, . . . , Xy each posses a known pdf f{x), and X^+i, Xyj^2, • • • are each drawn 
from a population with a pdf g{x) ^ f{x), also known. The change-point is the unknown serial number of 
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the last /( 2; ) -distributed observation; therefore, if 1/ = 00, then the entire series is sampled from 

distribution with the pdf /(x), and if v = 0, then all observations are ^'(x) -distributed. The experimenter's 
objective is to decide that the change is in effect, raise an alarm and respond. The challenge is to make 
this decision "as soon as possible" past and "no eaiiier" than a certain prescribed limit prior to the true 
change-point. 

Statistically, the problem is to sequentially differentiate between the hypotheses Hk- = k ^ 0, i.e., 
that the change occurs at time moment v = k, ^ k < 00, and : v = 00, i.e., that the change never 
strikes. Once the A;-th observation is made, the experimenter's decision options are either to accept T-Lk, and 
thus declai^e that the change has occurred, or to reject Hk and continue observing data. 

To test T-Lk against T-L^o, one first constructs the corresponding likelihood ratio (LR). Let Xi-n = 
(Xi, X2, . . . , Xn) be the vector of the first n ^ 1 observations; then the joint pdf-s of Xi-n under Hk 
and are 

n f ^ \ 

p{Xi.,n\n^) = l[f{Xj) and p{Xi.,n\nk) = llfix,) > 
j=l \j=l J 

where hereafter it is understood that nj=fc+i = fif k ^ n; i.e., p{Xi:n\'H^) = p{Xi:n\'Hk) if k ^ n. For 
the LR, Ak:n = p{Xi.n\'Hk)/p{Xi.n\'H^), one then obtains 

Ak:n= n ^^^""^ ^" = 77$4' (2.1) 

j=k+l -^^^"^ 

and we note that A^.^ = 1 if fc ^ n. We also assume that Aq = 1. 

Next, to decide which of the hypotheses T-Lk or T-L^ is true, the sequence {Afc:,j}i^fc^„ is turned into 
a detection statistic. To this end, one can either go with the maximum likelihood principle and use the 
detection statistic Wn = maxi^fc^^ A^.^ (maximal LR) or with the generalized Bayesian approach (limit of 
the Bayesian approach) and use the quasi-Bayesian detection statistic i?„ = Ylk=i ^k-.n (average LR with 
respect to an improper uniform prior distribution of the change-point). See Polunchenko and Tartakovsky 
(2011) for an overview of change-point detection approaches. 

Once the detection statistic is chosen, it is supplied to an appropriate sequential detection procedure. 
Given the series {X„}„^i, a detection procedure is defined as a stopping time, T, adapted to the filtration 
{-^n}n5:0> where Fn = (T{Xi-n) is the sigma-algebra generated by the observations collected up to time 
instant n ^ 1. The meaning of T is that after observing Xi, X2, ■ ■ ■ , Xt it is declai^ed that the change is in 
effect. That may or may not be the case. If it is not, then T ^ u, and it is said that a false alarm has been 
sounded. 

The above two detection statistics - {VFnjn^i and {Rn}n^i - give raise to a myriad of detection pro- 
cedures. The first one we will be interested in is the Cumulative SUM (CUSUM) procedure. This is a 
maximum LR-based "inspection scheme" proposed by Page (1954) to help solve issues arising in the area 
of industrial quality control. CuiTcntly, the CUSUM procedure is the de facto "workhorse" in a number of 
branches of engineering. Formally, we define the CUSUM procedure as the stopping time 

CA = mf{n^l:Wn> A}, (2.2) 

where A > is the detection threshold, and Wn = maxi^fc^„ Ak-n is the CUSUM statistic, which can also 
be defined recursively as 

Wn = max{l, Wn-i} An, n ^ 1 with Wq = 1, (2.3) 
Hereafter in the definitions of stopping times inf{0} = 00. 
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Another detection procedure we will consider is the the Shiryaev-Roberts (SR) procedure. In contrast 
to the CUSUM procedure, the SR procedure is based on the quasi-Bayesian argument. It is due to the 
independent work of Shiryaev (1961, 1963) and Roberts (1966). Specifically, Shiryaev introduced this 
procedure in continuous time in the context of detecting a change in the drift of a Brownian motion, and 
Roberts addressed the discrete-time problem of detecting a shift in the mean of a sequence of independent 
Gaussian random variables. Formally, the SR procedure is defined as the stopping time 

SA = mf{n^l: Rn^ A}, (2.4) 

where yl > is the detection threshold, and ii„ = Ylk=i ^k-.n is the SR statistic, which can also be 
computed recursively as 

Rn = (1 + Rn-i) K, n^l with Ro = 0, (2.5) 

Note that the SR statistic starts from zero. 

We will also be interested in two derivatives of the SR procedure: the Shiryaev-Roberts-Pollak (SRP) 
procedure and the Shiryaev-Roberts-r (SR-r) procedure. The former is due to PoUak (1985) whose idea 
was to start the SR detection statistic, {Rn}n^o> not from zero, but from a random point Rq = R^, where 
Rq is sampled from the quasi-stationary distribution Qa{x) = lim„_>.oo Poo (-Rn ^ x\Sa > n) of the SR 
detection statistic under the hypothesis H^. (Note that {i?n}n>o is a Markov Harris-recurrent process under 
Tiao-) Formally, the SRP procedure is defined as the stopping time 

S'^ = mf{n^l: R^^ A}, (2.6) 
where ^ > is the detection threshold, and Rn is the SRP detection statistic given by the recursion 

rQ = {1 + A„, n^l with R^ oc Qa{x). (2.7) 

The SR-r procedure was proposed by Moustakides et al. (2011) who regard starting off the original SR 
detection statistic, {Rn}n^o, from a fixed (but specially designed) Rq = r, ^ r < A, and defining the 
stopping time with this new deterministic initialization as 

5^ = inf{n ^ 1: i?; ^ A}, (2.8) 

where A > Ois the detection threshold and the SR-r detection statistic R^ is given by the recursion 

i?; = (1 + An, n ^ 1 with R'Q = r^O, (2.9) 

Note that for r = this is nothing but the conventional SR procedure. 

We now proceed with reviewing the optimality criteria whereby one decides which procedure to use. 

We first set down some additional notation. Let Pty(-) be the probability measure generated by the obser- 
vations {Xn}n^i when the change-point is u ^ 0, and be the corresponding expectation. Likewise, 
let Poo(') and ] denote the same under the no-change scenario, i.e., when u = oo. 

Consider first the minimax formulation proposed by Lorden (1971) where the risk of raising a false 
alarm is measured by the ARL to false alarm ARL(T) = E^[T] and the delay to detection by the "worst- 
worst-case" ADD J-l{T) = supo<;^<oo{esssupE;^[(T — z^)+|7",^]}, where hereafter = max{0,x}. Let 
A(7) = {T: ARL(T) ^ 7} be the class of detection procedures (stopping times) for which the ARL 
to false alarm does not fall below a given (desired and a priori set) level 7 > 1. Lorden's version of the 
minimax optimization problem is to find Topt G ^(7) such that J\,{Topt) = infrGA{7) J^hiT) for every 
7 > 1. For the iid model, this problem was solved by Moustakides (1986), who showed that the solution is 
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the CUSUM procedure. Specifically, if we select the detection threshold A = Ay from the solution of the 
equation ARL(Ca^) = 7, then Jl{Ca^) = infTGA(7) Jl{T) for every 7 > 1. 

Though CUSUM's strict ^L(7')-optimality is a strong result, ideal for engineering purposes would be to 
have a procedure that minimizes the average (conditional) detection delay, ADD^(T) = K,y[T — ulT > v], 
for all u ^ simultaneously. As no such uniformly optimal procedure is possible, PoUak (1985) suggested 
to revise Lorden's version of minimax optimality by replacing Jl{T) with J-p{T) = supo^,y<oo ADD,^(T), 
i.e., with the worst average (conditional) detection delay, and seek Topt G A (7) such that J^{Topt) = 
™fTeA{7) ^p(^) for every 7 > 1. We believe that this problem has higher applied potential than Lorden's. 
Contrary to the latter, an exact solution to PoUak's minimax problem is still an open question. See, e.g., Pol- 
lak (1985), Polunchenko and Tartakovsky (2010), Moustakides et al. (2011), and Tartakovsky et al. (2011) 
for a related study. 

Note that Lorden's and PoUak's versions of the minimax formulation both assume that the detection 
procedure is applied only once; the result is either a false alarm, or a correct (though delayed) detection, and 
no data sampling is done past the stopping point. This is known as the single-run paradigm. Yet another 
formulation emerges if one considers applying the same procedure in cycles, e.g., starting anew after every 
false alarm. This is the multi-cyclic formulation. 

Specifically, the idea is to assume that in exchange for the assurance that the change will be detected with 
maximal speed, the experimenter agrees to go through a "storm" of false alarms along the way. The false 
alarms are ensued from repeatedly applying the same detection rule, starting from scratch after each false 
alarm. Put otherwise, suppose the change-point, u, is substantially larger than the desired level of the ARL 
to false alarm, 7 > 1. That is, the change occurs in a distant future and it is preceded by a stationary flow 
of false alarms; the ARL to false alarm in this context is the mean time between consecutive false alarms. 
As argued by Pollak and Tartakovsky (2009), this comes in handy in many surveillance applications, in 
particular in the area of cybersecurity which will be addressed in Section 6. 

Formally, let Ti , r2 , . . . denote sequential independent applications of the same stopping time T, and let 
7(j) = r(i) + T(2) + • • • + T'q) be the time of the j-th alarm, j ^ 1. Let 1^ = min{j ^ 1 : 7(j) > y} 
so that 7(/^) is the point of detection of the true change, which occurs at time instant v after lu — I false 
alarms have been raised. Consider Js^viT) = \\my^ao'^u\T[i^) — the limiting value of the ADD 

that we will refer to as the stationary ADD (STADD); then the multi-cyclic optimization problem consists 
in finding Topt G A(7) such that i7sT(7opt) = infTgA(7) JsTiT) for every 7 > 1. For the continuos-time 
Brownian motion model, this formulation was first proposed by Shiryaev (1961, 1963) who showed that 
the SR procedure is strictly optimal. For the discrete-time iid model, optimality of the SR procedure in this 
setting was solved recently by Pollak and Tartakovsky (2009). Specifically, introduce the integral average 
detection delay 



and the relative integral average detection delay Jq^ (T) = lADD(T) / ARL(r). If the detection threshold, 
A, is set to Ay, the solution of the equation ARL(5^^) = 7, then Jgb{Sa^) = infreA(7) Jgb{T) for every 
7 > 1. Also Jgb{T) = Jst{T) for any stopping time T, so that Jst{Sa-,) = infrGA(7) Jst{T) for every 
7 > 1. 

We conclude this section reiterating that our goal is to test the above four procedures - CUSUM, SR, 
SRP and SR-r - against each other with respect to two measures of detection delay - J-p{T) and ^st(^)- 
We will also analyze the accuracy of the asymptotic approximations which are introduced in the next section. 
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3. ASYMPTOTIC OPTIMALITY AND PERFORMANCE APPROXIMATIONS 



To remind, we focus on Pollak's minimax formulation of minimizing the maximal ADD and on the multi- 
cyclic formulation of minimizing the stationary ADD. We also indicated that the latter is a solved problem 
(and the solution is the SR procedure), while the former is still an open question. The usual way around 
is to consider the asymptotic case 7 — )• oo. The hope is to design a procedure, T*^^ G ^(7)> such that 
•^p(^opt) ^^'^ (unknown) optimum, mireAi-y) J'p{T), will be in some sense "close" to each other in 
the Umit, as 7 — 00. To this end, three different types of J^{T*^^)-io-m.iT^^('-i) J'p{T) convergence are 
generally distinguished: a) A procedure T*^^^ G ^(7) is said to be first-order asymptotically ^p(T)-optimal 
in the class A(7) if Jv{T*^t) = [infTGA(7) Jp(7')][l + o(l)], as 7 00, where hereafter o(l) 0, 
as 7 — ;> 00, b) second-order - if Jv{T*^^) = infT£A(7) ^p(^) + 0{\), as 7 — ;> 00, where 0(1) stays 
bounded, as 7 — ;> 00, and c) third-order - if Jv{T*^^) = infj.g^(^) Jv{T) + o(l), as 7 — )• 00. Note that 
infTeA(7) J{T) — > 00 as 7 — ^ 00. 

We now review our four procedures' individual asymptotic optimality properties and provide the cor- 
responding asymptotic approximations for ARL(T) = Eoo[r], i7p(T) = supQ^^^^^ ADD^(r) (where 
ADD^(r) = E^[r - v\T > z/]), ADD^(r) = lim^^ooADD^(r), and Js^{T). Let Z., = logA^ 
and let If = — Eoo[Zi] and Ig = Eo[.Z'i] denote the KuUback-Leibler information numbers. Hence- 
forth, it will be assumed that Zi is P^- and Po-nonarithmetic and that < /j < 00 and < < 00. 
Throughout the rest of the paper we will also assume the second moment conditions Eqo | ^1 P < 00 and 
Eq \Zi\'^ < 00. Further, introduce the random walk {Sn]n^o, where 5.„ = Z^j=i Zi, n ^ 1, with = 0. 
Let Ko = Sjli and define Q{x) = Po(Ko ^ x). Next, for a ^ 0, introduce the one-sided stopping 
time Ta = inf{n ^ 1 : 5„ ^ a} and let = Sr^ — a denote the overshoot (i.e., the excess of 5„ over the 
level a at stopping). Define (" = lima_j.oo lEo[e~'^"] and x = lima_^oc Eo[Ka]. It can be shown that 



where hereafter x = — min(0, x); cf., e.g., Woodroofe (1982, Chapters 2 & 3) and Siegmund (1985, 
Chapter VIII). 

We are now poised to discuss optimality properties and approximations for the operating character- 
istics of the four detection procedures of interest. We begin with the CUSUM procedure. First, from 
the fact that Jl(Ca) = Jp(Ca) = ADDo(Ca) and the work of Lorden (1971) one can deduce that the 
CUSUM procedure is first-order asymptotically ^p(T) -optimal. However, in fact it is second-order J-p{T)- 
minimax which can be established as follows. Since CUSUM is strictly optimal in the sense of minimizing 
the essential supremum ADD Ji^{T), which is equal to ADDq for both CUSUM and SR, it is clear that 
ADDo(Ca) < ADDo(5a), where the thresholds are different for CUSUM and SR to attain the same ARL 
to false alarm 7. By Tartakovsky et al. (2011), the SR procedure is second-order asymptotically mini- 
max with respect to J-p{T), so that CUSUM is also second-order asymptotically ^p(T)-optimal. Now, let 
A = Ay, where A^ is the solution of the equation AKL^Ca^) = 7- Then 



where /3o = Eo[min„^o 'S'n]- We iterate that J^p{Ca) = ^o[Ca]- This asymptotic expansion was first 
obtained by Dragalin (1994) for the single-parameter exponential family, but it holds in a general case too as 
long as the second moment condition Eo[Z^] < 00 is satisfied and Zi is Po-non-arithmetic. See Tartakovsky 
(2005), who also shows that 




(3.1) 



Jp(CaJ = i-(logA^ + x + /3o) + 0(1), as 7^00, 



(3.2) 
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where = lim„_^oo [Sn — mino«;fcsgn 5'fc] ; constants /3q and 13^ can be computed numerically (e.g., by 
Monte Carlo simulations). An accurate approximation for the ARL to false alarm is as follows: 



A log A 1 



ARL(C^) « TT2 - - — (3.4) 



We now proceed to the SR procedure. First, recall that, by Pollak and Tartakovsky (2009), this procedure 
is a strictly optimal multi-cyclic procedure in the sense of minimizing the stationary ADD. The following 
lower bound for the minimal SADD is fundamental for the analysis carried out in Section 5: for every r ^ 0, 

rADDo[cS^ ] +IADD(5^ ) 

where A.^ is the solution of the equation ARL(5^ ) = 7, 7 > 1 (see Moustakides et al. 2011 and Pol- 
unchenko and Tartakovsky 2010). Taking r = in (3.5), we obtain infygA(7) ^p(^) ^ Js,'v{Sa-~i)- 
Let be a random variable that has the P^o-limiting (stationary) distribution of Rn as n — )• 00, i.e., 

Qst(2;) = lim„_^oo ^^{Rn <.x)= ¥^{R^ ^ x). Let V = ZT=i e""^' and Q{x) = Po(y ^ x). Let 

C^=E[\ogil + R^ + V)]= / logil + x + y)dQsTix)dQiy). (3.6) 

Jo Jo 

By Tartakovsky et al. (2011), 

.7st(5a^) = 7-(log + X- Coo) + 0(1), as 7-^cx), 

so that 

inf Jp(r) ^ — (log^^ + x-Coo) + o(l), as 7^00. (3.7) 

TGA(7) Ig 

On the other hand, a straightforward argument shows that, for the SR procedure, if A = A^ is the solution 
of the equation ARL(5a^) = 7, then ARL(5a^) = 7 ~ and 

Jp(5a,) = MSa,] = ^(log + X - Co) + 0(1), as 7 ^ 00, (3.8) 
^9 

where 



/•OD 

Co = E[log(l + V^)] = / log(l + x) dQ{x) 

Jo 



(cf. Tartakovsky et al. (2011)). Since Co < Coo, it follows that the SR procedure is second-order minimax, 
the fact that we have already mentioned above when considering CUSUM. The difference is (Coo — Co) /Ig, 
which can be quite large when detecting small changes. 

For an arbitrary scenario, the constant Co and distribution Q{x) are amenable to numerical treatment. 
For cases where both can be computed exactly in a closed form see Tartakovsky et al. (2011) and Pol- 
unchenko and Tartakovsky (2011). The approximation ARL{Ca) ~ A/( is known to be very accurate 
and is therefore recommended for practical use; it can be derived from the the fact that {Rn — n}n^Q is a 
zero-mean -martingale. 

We now consider the SRP procedure. From decision theory (see, e.g., Ferguson, 1967, Theorem 2.1 1.3) 
it is known that the minimax procedure should be an equalizer, i.e., ADD,y(T) should be the same for all 
1/ ^ 0. To make the SR procedure an equalizer, Pollak (1985) suggested to start the SR detection statistic, 
{Rn}n^o^ from a random point distributed according to the quasi-stationary distribution. However, Pollak 
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was able to demonstrate that the SRP procedure is only asymptotically third-order ^p(T)-optimal. More 
specifically, let Eo[^/'] < oo, and suppose that the detection threshold, A, is set to A^, the solution of 
the equation ARL(5^ ) = 7. Then Jp(cS^ ) = MT^Ai-y) Jp(.T) + o(l), as 7 — 00. Recently, Tar- 

takovsky et al. (2011) obtained an asymptotic approximation for J^p{S^) under the second moment condi- 
tion Eo[Zf] < 00: 

^P('5?J = ^(log^7 + >^-'^oo) + o(l), as 7^00, (3.9) 
-'5 

and we note that Jp(5^) = ADD„{S'^) for all 1/ ^ 0, and Jp(5^) = Jst(5^)- To approximate 
ARL(5^), Tartakovsky et al. (2011) suggest to use the formula ARL(5^) A/( — /ig, where fig = 
X dQA{x) is the mean of the quasi-stationary distribution. 

It is left to consider the SR-r procedure. Since this procedure is sensitive to the choice of the starting 
point, we first discuss the question of how to design this point. To this end, fundamental is the lower bound 
for the minimal SADD given in (3.5). From (3.5) one can deduce that if the starting point = r* is chosen 
so that the SR-r procedure is an equalizer (i.e., ADD^{S^ ) is the same for all v ^ 0), then it will be exactly 
^p(r)-optimal. Indeed, if the SR-r procedure is an equalizer, then i7lb('5^*) = Eo[5^*] = J^p{S^). This 
argument was exploited by Polunchenko and Tartakovsky (2010), who found change-point scenarios where 
ARL(5^), ADD,y(5^), u ^ 0, and r* can be obtained in a closed form and the SR— r procedure with 
r = r* is strictly optimal. In general, Moustakides et al. (2011) suggest to set the starting point to the 
solution of the following constraint minimization problem: 

r* = arginf{Jp(cS^)- Jlb(5^)} subject to ARL(cS^) = 7. (3.10) 

Os;r'<oo 

We discuss this approach in detail in Section 5 where we also discuss certain properties of r*, as well as 
another approach that allows one to obtain a neai^ly optimal result for the low false alarm rate. 

As shown by Tartakovsky et al. (2011), the SR-r procedure also enjoys the third-order asymptotic 
optimaUty property. Specifically, assume that the detection threshold A = Ajis the solution of the equation 
ARL(5^^) = 7 and the head start r (either fixed or growing at the rate of 0(7)) is selected in such a way 
that ARL(5^ ) A^/( and that the supremum ADD J^p{S^ ) is attained at infinity, i.e., J^p{S^ ) = 
ADD^(5;^J.\hen 

Jp(5^^)=ADD^(cS^^) = ^(log^^ + x-C^) + o(l), as 7^00. (3.11) 

-'9 

Comparing with the lower bound (3.7) we see that in this case the SR— r procedure is nearly optimal (within 
the negligible term o(l)). For practical purposes, to approximate ARL(5^) one may use the formula 
ARL(5^) « j4/C—r, which is rather accurate, and can be easily obtained from the fact that {iiJ^ — n—r}„^o 
is a zero-mean Poo-martingale. 

4. NUMERICAL TECHNIQUES FOR EVALUATION OF OPERATING CHARACTER- 
ISTICS 

In this section we present integral equations for operating characteristics of change detection procedures as 
well as outline numerical techniques for solving these equations that can be effectively used for performance 
evaluation. Further details can be found in Moustakides et al. (2011). 

Consider a generic detection procedure described by the stopping time 

n = inf{n ^l-.V^^A}, A> 0. (4.1) 
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where {V^^jn^o is a generic Markov detection statistic that follows the recursion 

= a^-i) A„, n > 1 with V^ = s^ 0. (4.2) 

Here ^{x) is a non-negative function and s is a fixed parameter, referred to as the starting point or the head 
start. 

The above generic stopping time describes a rather broad class of detection procedures; in particular, 
all considered procedures belong to this class. Indeed, for the CUSUM procedure £^{x) = max{l,a;} and 
for the SR-type procedures ^{x) = 1 + x. This universality of 7^ enables one to evaluate practically any 
Operating Characteristic (OC) of any procedure that is a special case of 7^, merely by picking the right 
^(x). We now provide a set of exact integral equations on all major OC-s for T^. 

Recall that A„ = g{Xn)/ f{Xn), n ^ 1, and assume that Ai is continuous. For d = {0,oo}, let 
P^{t) = Pd(Ai ^ t) denote the cdf of the LR. Let 



ICd{x,y) = ^F,(KVi ^ y\v^ = x) = ^Pj^ ) ' = ^0'°^} 



denote the transition probability density kernel for the Markov process {V^}n>i- Note that dP^{t) = 
tdP^{t), and therefore, )Co{x,y)/)C^{x,y) = y/S,{x), which can be used as a "shortcut" in deriving the 
formula for JCo{x, y) from that for /C^(x, y) or vice versa. 

Suppose that Vq = x is fixed. Let l{x) = 'E^\T^] and 5o{x) = Eo[T^']. It can be shown that l{x) and 
5q{x) satisfy the renewal equations 

pA f-A 
l{x) = l+ / lCUx.y)l[y)dy and 5o[x) = I + / K^^x^y) d^^y) dy, (4.3) 



JO JO 

respectively (cf. Moustakides et al. (2011)). Next, consider KD'Dy{TX) = ^u[T^ — ^\TX > v\ for an 
arbitrary fixed v ~^ \\ note that ADDo(7^) = (5o(x). To evaluate ADD,,(7^), Moustakides et al. (201 1) first 
argue that ^^iJl > v) = ^^{Tl > ly), and consequently, ADD^(rj^) = E^[{TX - u)+]/F^{TX > ly), 
so that we turn attention to 6^{x) = K^[{TX — z^)"*"] and Pu{x) = F^iTJ > i^)- It is direct to see that 

i-A i-A 
6u{x) = / K,^{x,y)5u^i{y)dy and Pu{x) = / IC^{x,y) pu^i{y) dy, u ^ 1, 
Jo Jo 

where 6o{x) is as in (4.3) and po{x) = 1, since F^{TX > 0) = 1; cf. Moustakides et al. (2011). As soon as 
and puix) are found, by the above argument ADD,y(T^) can be evaluated as the ratio 6y{x) / py{x). 
Furthermore, using ADD,y(7^)'s computed for sufficiently many successive z^'s beginning from v = and 
higher, one can also evaluate Jp(7^) = supo^,^<oo ADD^(7^), since ADD^(7^') = lim,^_j.oo ADD,^(7^) 
is independent of Vq = x. 

We now proceed to the stationary average detection delay J^st{T). It follows from Pollak and Tar- 
takovsky (2009) that the STADD is equal to the relative integral ADD: 

/ oo \ 



\ u=0 / 

andif weletV'(x) = ET=o^'^ii'^A - '')^] = El^o '^^(^)' ^en Jst(T|) = ip{x)/e{x). It can be shown 
that ■ip{x) satisfies 

^{x) = 5q{x)+ [ IC^{x,y)7P{y)dy, (4.4) 
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where 6q{x) is as in (4.3). 

The lADD is also involved in the lower bound J^lb{T) which is defined in (3.5) and is specific to 
the SR-r procedure, i.e., for the case when ^(x) = I + x. Assuming this choice of ^(x), observe that 
'Jhsi'S^) = ['fSo{r) + ■ip{r)]/[r + £{r)], which can be evaluated once £{x), 6o{x), and 'ip{x) are found. 

Consider now randomizing the starting point Vq = x in a fashion analogous that behind the SRP 
stopping time. Let Qa{x) = lim„_^oo IPoo(V^ ^ ^ITa > ^) quasi-stationary distribution; note that this 
distribution exists, as guaranteed by (Harris, 1963, Theorem III. 10.1). Let = inf{n ^ 1: Vn ^ A}, 
where {Vn}n^o is a randomized generic detection statistic such that = C(^„^i)^n for n ^ 1 with 

oc Qa- Note that is the SRP procedure when £,{x) = l + x. 

For Ta, any OC is dependent upon the quasi-stationary distribution. We therefore first state the equation 
that determines the quasi-stationary pdf qA{x) = dQA{x)/dx: 

^AqA{y)=l qA{x)JC^{x,y)dx, subject to / qA{x) dx = I (4.5) 
Jo Jo 

(cf. Moustakides et al. (2011) and Pollak (1985)). We note that qA{x) and Xa are both unique. Once 
qA{x) and are found, one can compute the ARL to false alarm I = K^[Ta] and the detection delay 
5 = Eo[T^], which is independent from the change-point: 

(•A _ 

i= l{x) qA{x) dx = 1 / {1 — Xa) and S= 6o{x) qAix) dx. 
Jo Jo 



The second equality in the above formula for £ is due to the fact that, by design, the -distribution of 
the discrete random variable Ta is exactly geometric with parameter 1 — Xa, < < 1, i.e., ¥^{Ta > 
u) = X\, V ^ 0; in general, lim^_j.oo Aa = 1- Also, Pollak and Tartakovsky (2011) provide sufficient 
conditions for Xa to be an increasing function of A; in particular, they show that if the cdf of Zi = log Ai 
under measure ]Poo(-) is concave, A^ is increasing in A. 

Equations (4.3)-(4.5) govern all of the desired performance measures (OC-s) for all considered detec- 
tion procedures. The final question is that of solving these equations to evaluate the OC-s. Usually these 
equation cannot be solved analytically, so that a numerical solver is in order. We will rely on one offered 
by Moustakides et al. (2011). Specifically, it is a piecewise-constant (zero-order polynomial) collocation 
method with the interval of integration [0, A] partitioned into ^ 1 equally long subintervals. The collo- 
cation nodes are the subintervals' middle points. As the simplest case of the piecewise collocation method 
(see, e.g., Atkinson and Han, 2009, Chapter 12), the question of accuracy is a well-understood one, and tight 
error bounds can be easily obtained from, e.g., (Atkinson and Han, 2009, Theorem 12.1.2). Specifically, it 
can be shown that the uniform norm of the difference between the exact solution and the approximate 
one is 0(1/A^), provided is sufficiently large. 

We are now in a position to apply the above performance evaluation methodology to the J\f{fi, a/i)-to- 
M{6, aO) change-point scenario, and compute and compare against one another the performance of the four 
detection procedures. 

5. PERFORMANCE ANALYSIS 

In this section, we utilize the performance evaluation methodology of the preceding section for the AA(/i, a/x)- 
to-J\f{9, a9) change-point scenario (1.1) and evaluate operating characteristics of four detection procedures. 
We first describe how we performed the computations and then report and discuss the obtained results. 
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We start with deriving pre- and post-change distributions P^{t) = F^^Ai ^ t), d = {0, 00} of the LR. 



It can be seen that 



A„ 



— exp 



A* 



2a 



exp 



and 



exp 



2a 



if 9 - fi> 0, and A„ < 



\ 2a(9/i 



■ exp 



9- fi 
2a 



, if 9- fi<0. 



Observe that to obtain the required distributions, one is to consider separately two cases 9 — fi > and 
9 — n < 0. Analytically, both cases can be treated in a similar manner, so that we present the details only in 
the first case. 

Let 9 — > 0. For any measure P(-) it is straightforward to see that 



\An < t) 



Xn ^ \ 9fl 



21ogt-log^)/(0-^f) + l 



Xn^-\l9fi (21ogt-log^)/(^-/x) + l 



f or t ^ W ^ exp 



9- n 
2a 



and P(A,i ^ t) = otherwise. 
Hence, 



9fi (21ogt-log^)/(0-^) + l 
- $ 



^ fi,a/j, 



9i^[(2logt-log^)/{9-fi) + l 



for t > 



■ exp 



9-fi 
2a 



and P^{t) = otherwise; hereafter, 



V27ro-2 



exp 



2a2 



is the cdf of a Gaussian distribution with mean fi and variance cr^. 
Similarly, 



Poit) = '^e,ae {\IOfi (21ogt - log^ ) /{9 - /i) + 1 



9fi (21ogt-log^)/(0-^) + l 



f or i ^ W ^ exp 



2a 



and P^{t) = otherwise. 
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As soon as both distributions, P^{t) and P^{t), are expressed in a closed form, one is ready to employ 
the numerical method of Moustakides et al. (2011). We implemented the method in MATLAB, and solved 
the corresponding equations with an accuracy of a fraction of a percent. 



It is also easily seen that the KuUback-Leibler information numbers If 
Eo[logAi] are 



If 



9y 1 
2a9 ^ 2 



Also, for this model, we have 



log 



and I„ 



^ , 1 
2aii 2 



■E^[logAi] and Ig 



log- 



Sk 



^logA 



X 



2fM 



E j _ ^ 
a9 



+ log- 



k>l. 



We now consider the question of computing x and (. To this end, in general both can be computed either 
using (3.1), or by Monte Carlo simulations. However, in our particular case, the latter is better since each Sk 
is non-central chi-squared distributed, and this distribution is an infinite series, which is slowly converging 
for weak changes, i.e., when 9 and fi are close to each other. This translates into a computational issue. 
Hence, x and ( as well as Co and are all evaluated by Monte Caiio simulations. 

Specifically, to compute x and C we generate 10^ trajectories of Sk for k ranging from 1 to 10^ for each 
measure Pd(-), d = {0, oo}, and estimate x and ( using (3.1). That is, we cut the infinite series appearing 
in both formulae at 10^. The same data are used to compute constants Pq and p^, as well as constants Co 
and Coo. We directly estimate the coiTcsponding integrals, which is a textbook Monte Carlo problem. 

One of the most important issues is the design of the SR-r procedure, namely, choosing the starting 
point r. As we discussed in Section 3, the head start Rq = r* should be set to the solution of the con- 
strained minimization problem (3.10). Clearly, with this initialization the SADD of the SR-r procedure is 
as close to the optimum as possible. The challenge is that unlike other procedures, one cannot determine 
the detection threshold from the equation ARL(5^) = 7 for the desired 7 > 1, even using the approxi- 
mation ARL(5^) (^/C) ~ f- The reason is that we now have two unknowns - A and r. This is where 
r* steps in. For a fixed 7, it is clear that as A grows, in order to maintain ARL(5^) at the same level 7 
(i.e., to keep the constraint ARL(5^) = 7 satisfied), r has to grow as well. The smallest value of r is 0, 
and when r = 0, we can easily determine the (smallest) value of A for which ARL(5^) = 7, because in 
this case it is the SR procedure, and ARL(5^) A/C- Once this A is found, we compute J'p{S^) and 
^lb('5^). and record the difference J^p{S^) — ■7lb('5^)- Then we increase the detection threshold, and find 
r for which ARL(5]^) = 7 is satisfied, and again compute J^p{S'J^) and J^LBi-S^A)' record the difference 
i7p(5^) — J^lb{-S'^)- This process continues until enough data is collected to figure out what A and r* are. 
See also Polunchenko and Tartakovsky (2011, Section 5). 

Though r* is an intuitively appealing initialization strategy, it is interesting only from a theoretical point 
of view. For practical purposes, Tartakovsky et al. (2011) suggest to choose r so as to equate ADDo(5^) 
and ADDoo(5^). This is the expected behavior of the SR-r procedure for large A's, and we in fact also do 
observe it for our example. For large values of A, ADDo(5^) can be approximated as 

1 



where 



ADDo(5^)«-(logA + x-C,), 
-'9 



C, = E[log(l +r + Vj]= / log(l + r + y) dQ{y), 

Jo 
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and and Q{y) are as in Section 3. We also recall that ADD^(5^) = J7p(5^) admits the same asymp- 
totics (3.11), except Cr is replaced with defined in (3.6). Hence, for large values of A, the equation 
ADDo(5^) = ADDoo(5^), where the unknown is r, reduces to the equation Cr = C^. Let be its 
solution. The important feature of compared to r* is that it does not depend of the threshold (i.e., on 7). 
Tartakovsky et al. (2011) and Polunchenko and Tartakovsky (2011) offer an example where Cr and C^o can 
be found in a closed form, and the equation Cr = C^ can be written out and solved explicitly. 

As a first illustration, suppose fi = 1000, 6 = 1001, and a = 0.01, and set 7 = 10^ for each procedure in 
question. The corresponding detection threshold and head start r* for each procedure are reported in Table 1 . 
For this case we have « 5 x 10"^, « 5 x 10~2, ( ^ 0.83145, x « 0.22, Co « 3.59, C^ ^ 4.5, 
/3o ~ —1, and 1.3. We see that for the SR procedure the asymptotics ARL(5^) « A/C is again 

perfect. The accuracy is great for the SRP rule as well: the asymptotics ARL(5^) A/( — /ig yields 
a value of 9999.5 average observations, while the computed value is 9999.845. For the SR-r procedure, 
the asymptotics ARL(5^) « A/( — r yields a value of 9999.57 average observations, while the computed 
value is 9999.875. For the CUSUM procedure, the approximation for the ARL (3.4) gives a value of 10006, 
while the computed one is 10001.22. 

Table 1. Thresholds, head start values and the ARL to false alarm for the four detection procedures for 
fi = 1000, e = 1001, a = 0.01, and 7 = 10^. ARL(r) is the actual ARL to false alarm exhibited 
by the procedure for the selected detection threshold and head start evaluated numerically 



Procedure 


A 


Head Start 


ARL(r) = E^[T] 


CUSUM 


350.75 


any VFq ^ 1 


10001.223 


SR 


8314.4 


Ro = 


10000.188 


SRP 


8392.0 


R^^Qa (^q~ 93.699) 


9999.845 


SR-r 


8356.0 


R'q w 50.345 


9999.875 



While not presented in this table, we now discuss the accuracy of the asymptotic approximations given 
in Section 3. The asymptotics for J^p{Sa) given by (3.8) yields a value of 117, which aligns well with the 
observed 113. For the CUSUM procedure we obtain that the approximation for Jp(Ca) (see (3.2)) gives 
101 vs. 105 observed. For ADD^{Ca) given in (3.3) we have 93 vs. 95 observed. For the approximations 
Jp{S^), Jp(5^) and ADD^(cS^) (see (3.9) and (3.11)) we have 93 vs. 95 observed. We can conclude that 
the accuracy of asymptotic approximations is satisfactory. 

Next, the ADD,y(T) = K^lT — v\T > u] vs. the changepoint v is shown in Figure 1. We first discuss 
the procedures' performance for changes that take place either immediately (i.e., v = 0), or in the near 
to mid-term future. As we have mentioned earlier, for the CUSUM and SR procedures the worst ADD is 
attained at 1/ = 0, i.e., J'p{Ca) = ^o[Ca\ and J^p{Sa) = ^oI-Sa]- From Figure 1 we gather that at = 
the average delay of the CUSUM procedure is 105 observations and of the SR procedure is 113. Overall, 
we see that the CUSUM procedure is superior to the SR procedure for ^ ^ 60; the two procedures are 
at performance parity at = 60. One can therefore conclude that the CUSUM procedure is better than the 
SR rule for changes that occur either immediately, or in the near to mid-term future. However, for distant 
changes (i.e., when u is large), it is the opposite, as expected. Below we will consider one more case where 
the SR procedure significantly outperforms the CUSUM procedure. For now note that the only procedure 
indifferent to how soon or late the change may take place is the SRP procedure S^; recall that it is an 
equalizer, i.e., ADDi,(5^) is the same for all z/ ^ 0. The (flat) ADD for the SRP procedure is 94. Since 
7 = 10'^, SRP and SR-r are nearly indistinguishable. Yet SR— r appears to be uniformly (though slightly) 
better than SRP. 

The obtained ADD^{T) for each procedure for selected values of u are summarized in Table 2. The table 
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also reports the lower bound (3.5) and the STADD Jgx(T). From the table we see that for the considered 
parameters all procedures deliver about the same STADD. 



Table 2. Summary of the performance of the four procedures for fi = 1000, 9 = 1001, a = 0.01, and 
7 = 10^ 



Procedure 


ADD^(r) = E^[T - iy\T > v\ 







50 


100 


150 


200 


:^st(t) 


CUSUM 


104.98 
(SADD) 


96.72 


95.75 


95.57 


95.53 


95.55 


SR 


112.87 
(SADD) 


97.26 


94.75 


94.15 


94.00 


94.00 


SRP 


94.127 (SADD) 


SR-r 


93.38 


94.04 


94.04 


94.04 


94.04 
(SADD) 


94.04 


Lower Bound 


94.04 



As another illustration, suppose /^t = 1000, = 1001, but a = 1. Also, assume that 7 = lO'^. The 
con^esponding detection threshold and head start for each procedure are reported in Table 3. For this case 
we have If ^ 5 x IQ-^, Ig ^ 5 x 10'^, ( ~ 0.981, x ^ 1.55, Co ^ 8.08 and w 8.82, po « -2.1 
and /5oo 2.2. We see that for the SR procedure the asymptotics ARL{Sa) ~ ^/C is perfect. However, 
the accuracy is not as great for the SRP rule: the asymptotics ARL(5^) « A/( — fiq yields a value of 
929.72, while the computed value is 1000.33. For the SR-r procedure, the accuracy is slightly off as well: 
the asymptotics ARL{S'^) ^ A/C - r yields 930.72, while the computed one is 999.98. For the CUSUM 
procedure, approximation (3.4) gives 1358.3, while the computed one is 1000.1. 
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Table 3. Thresholds, head start values and the ARL to false alaim for the four detection procedures for 
= 1000, e = 1001, a = 1, and 7 = 10^. ARL(T) is the actual ARL to false alarm exhibited by 
the procedure for the selected detection threshold and head start evaluated numerically 



Procedure 


A 


Head Start 


ARL(r) = E^[r] 


CUSUM 


2.272 


any M^o =^ 1 


1000.096 


SR 


981.0 


Ro = 


999.996 


SRP 


1844.0 


R^^xQa (//q f« 879.248) 


1000.333 


SR-r 


1811.0 


Rl = r* K 845.872 


999.981 



The asymptotics for J-p{Sa) yields a value of 717, which aligns well with the observed 722. For the 
CUSUM procedure we obtain that the approximation for Jy>{Ca) gives 541 vs. 563. For ADDoo(C^) we 
have 314 vs. 463. For J-p{S^), ^p(5^) and AT)Y)^{S\) (which are asymptotically all the same) we have 
500 vs. 502. We see that the accuracy is reasonable. 
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Figure 2. Average detection delay ADD;^(T) = 'Ey\T — v\T > u] for the four detection procedures for 
Ai = 1000, e = 1001, a = 1, and 7 = 10^. 

Figure 2 shows the plots of ADDi^(r) = Ei,[T — i/|T > u] vs. the changepoint v for all four procedures 
of interest. At = the CUSUM procedure requires 563 observations (on average) to detect this change, 
while the SR procedure 722 observations. Overall, we see that the CUSUM procedure is superior to the 
SR procedure for ^ ^ 270; the two procedures are at performance parity at z/ = 270. Again, we can 
conclude that the CUSUM procedure is better than the SR rule for changes that occur either immediately, 
or in the near to mid-term future. However, for u ^ 1000, the SR procedure considerably outperforms the 
CUSUM procedure: the former's ADD is 263, while the letter's is 463. Hence, the SR procedure is better for 
detecting distant changes than the CUSUM procedure, as expected. The (flat) ADD for the SRP procedure 
is approximately 503. This is twice as worse as the SR procedure for u ^ 1500, whose detection lag for 
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such lai^ge i/'s is 263, and stays steady for lai^ger z^'s. 

Regarding the performance of the SR-r procedure compared to that of the other three procedures, one 
immediate observation is that SR-r is uniformly better than the SRP rule; although the difference is only 
8-^9 samples, i.e., a mere fraction of a percent compared to the magnitude of the ADD itself. A similar 
conclusion was previously reached by Moustakides et al. (2011) for the Gaussian-to-Gaussian model and 
for the exponential-to-exponential model. Overall, out of the four procedures in question, the best in the 
minimax sense is the SR-r procedure. However, it is only slightly better that the SRP procedure since both 
are nearly asymptotical optimal in the minimax sense. 

Yet another observation one may make from Figure 2 is that the magnitude of the difference between 
ADDo(T) and ADD2ooo(r) for the SR procedure is extremely large, and the fact that the convergence to 
the steady state ADD is very slow. This can be explained by the fact that for this case the SR detection 
statistic is "singular": Rn as a function of n is a diagonal line with unit slope since i?„ deviates very little 
from its mean value n for such small changes. 

The obtained ADDj/(T) for each procedure for selected values of u are summarized in Table 4. The 
table also reports the lower bound (3.5) and the STADD J^st:{T). We remind that the SR procedure is strictly 
optimal in the sense of minimizing the STADD. From the table we see that its STADD is 396.44 average 
observations. Since the SRP rule, 5^, is an equalizer, i7p(5^) = J^st{S^)'> for this particular case, we see 
that the SRP rule is as efficient as 502.64 average observations in terms of its STADD. The other two rivals - 
the CUSUM procedure and the SR-r procedure - are almost equally efficient: 471.67 average observations 
for the former, and 477.43 average observations for the latter. All in all, the SRP rule is the worst in the 
multi-cyclic sense. 



Table 4. Summary of the performance of the four procedures for /i = 1000, 9 = 1001, a = 1, and 7 = 10^. 



Procedure 


ADD^(r) = K^[T - iy\T > u] 
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CUSUM 


563.26 
(SADD) 


495.06 


467.31 


463.29 


463.15 


463.15 


463.15 


All. 61 


SR 


722.36 
(SADD) 


626.20 


498.64 


339.18 


268.14 


263.27 


262.91 


396.44 


SRP 


502.636 (SADD) 


SR-r 


495.10 
(SADD) 


454.29 


454.39 


473.65 


489.82 


493.22 


493.89 


477.56 


Lower Bound 


485.60 



6. AN APPLICATION TO CYBERSECURITY 

In this section, we apply change-point detection theory in cybersecurity - for rapid anomaly detection in 
computer networks' traffic. A volume-type traffic anomaly is an event identified with a change in traffic's 
volume characteristics, e.g., in the traffic's intensity defined as the packet rate (i.e., number of network pack- 
ets transmitted through the link per time unit). For example, a change in the packet rate may be caused by an 
inti'usion attempt, a power outage, or a misconfiguration in the network equipment; whether unintentionally 
or not, such events occur daily and are of harm to the victim. One way to mitigate the harm is by devising an 
automated anomaly detection system to rapidly catch suspicious, spontaneous changes in the traffic flow, so 



17 



an appropriate response can be provided in a timely manner. To this end, such an anomaly detection system 
can be designed, e.g., by using change-point detection. 

This is not a new idea; see, e.g., Blazek et al. (2001); Tartakovsky et al. (2006a,b). It is typically a 
modification of either the CUSUM procedure or Wald's Sequential Probability Ratio Test (SPRT). The SR 
procedure has never been paid much attention in this application. However, recall that it is exactly optimal in 
the multi-cyclic sense for detecting changes that occur in a distant future. We believe that this formulation is 
appropriate for the problem of anomaly detection in computer networks, and hence, the SR procedure may 
be a better option than CUSUM. We will show that this is indeed the case. 

First, we note that in order to employ a detection procedure it is desirable to know the pre- and post- 
change models. As we focus on volume-type anomalies, the observations in our case are instantaneous 
values of the packet rate (traffic intensity). It is currently believed that the behavior of the packet rate can be 
well modeled by a Poisson process. The main reason is the tremendous growth of the Internet backbone in 
recent years. See, e.g., Cao et al. (2002), Kai^agiannis et al. (2004), and Vishwanathy et al. (2009). Using real 
data, we will show that the A/'(/x, afi)-to-J\f{6, aO) model (which is the standard continuous approximation 
to the discrete Poisson process when a = 1) is appropriate to characterize the traffic intensity. Moreover, 
this model is more general than the Poisson model since by controlling parameter a one can change the 
effect of a shift in the mean on the variance. 

We now present the results of testing of SR and CUSUM for a real Distributed Denial-of-Service (DDoS) 
attack, namely, an Internet Control Message Protocol (ICMP) reflector attack. The essence of this kind of 
attacks is to congest the victim's link with echo reply (ping) requests sent by a large number of separate 
compromised machines (reflectors) so as to have the victim's machine exhaust all of its resources handling 
the ping requests and ultimately crush. This kind of an attack clearly creates a volume-type of anomaly in 
the victim's traffic flow. 

The ICMP data trace is courtesy of the Los Angeles Network Data Exchange and Repository (LAN- 
DER) project (see http://www.isi.edu/ant/lander). This is a reseaixh-oriented traffic capture, storage and 
analysis infrastructure that operates under Los Nettos, a regional Internet Service Provider (ISP) in the Los 
Angeles area. The aggregate traffic volume through Los Nettos measures over 1 Gigabit/s each way, and the 
ISP's backbone is 10 Gigabit. Leveraged by a Global Positioning System (GPS) clock, LANDER's capture 
equipment is able to collect data at line speed and with a down-to-the-nanosecond time precision. 

The attack starts at roughly 102 seconds into the trace and lasts for about 240 seconds. According to the 
data provider, the number of reflectors is 143, which gives a general idea as to the attack's intensity. The 
exact intensity can be gathered from Figure 3, which depicts the corresponding traffic volume characteristics. 
Specifically, Figure 3(a) shows the packet rate as a function of time. The sampling rate is 0.5 seconds. The 
total number of samples in the trace is 879. It can be seen that the packet rate explodes at the attack's starting 
point (102 seconds into the trace), and goes back to the original (pre-attack) level at the attack's ending point 
(roughly 348 seconds into the trace). Figure 3(b) shows the bit rate as a function of time. The bit rate is 
defined as the number of bits of information transmitted through the link per time unit. This is yet another 
volume characteristic. Apart from the spike at approximately 148 seconds into the trace, its behavior is 
relatively calm thr^oughout the trace. This is consistent with the fact that this is an ICMP reflector attack, 
i.e., a flooding-type of attack consisting of ping requests, which are small in size. That is, during the attack 
the traffic flow is very packet-dense, but the aggregate amount of information carried by the packets stays 
about the same as it was prior to the attack. 

We estimated the packet rate's average and variance for legitimate traffic and for attack traffic too; in 
both cases, to estimate the average we used the usual sample mean, and to estimate the variance we used 
the usual sample variance. For legitimate traffic we obtained that the average packet rate is about 13329.764 
packets per second, and the variance is in the neighborhood of 266972.736 packets per second; the aggregate 
number of samples is 401. For attack traffic the numbers are 17723.833 and 407968.14, respectively, while 
the number of samples is 478. We can now see the effect of the attack: it leads to an increase in the packet 
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(a) Packet rate. 



(b) Bit rate. 



Figure 3. ICMP reflector attack: traffic volume characteristics. 

rate's average and variance by approximately 50% each. Also, note that both averages are rather high. This 
justifies our large values for /x and 9 chosen in the preceding section. This also justifies the point made in 
the introduction that if the packet rate truly were Poisson, with such high average it would be practically 
indistinguishable from Gaussian with cr^ = /i. 

The fact that neither for legitimate traffic nor for attack traffic the variance is not the same as the mean 
suggests to revise the conventional Poisson model. A Gaussian distribution may be a better alternative. To 
see how well the data agree with this hypothesis, we now access the goodness of fit of a Gaussian distribution 
and the data. Figure 4 shows the empirical densities of the packet rate for legitimate and attack traffic. A 
high level of resemblance of both curves with the Gaussian distribution is apparent. 
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(a) Legitimate (pre-attack) traffic. 
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Figure 4. ICMP reflector attack: packet rate pdf with a Gaussian fit. 



The same conclusion can be made from an inspection of the corresponding Q-Q plots (quantile-quantile) 
shown in Figure 5. Specifically, the Q-Q plot for legitimate traffic is shown in Figure 5(a), and one for attack 
traffic in Figure 5(b). Both plots are for centered and scaled data, so the fitted Gaussian distribution is the 
standard normal distribution. The fact that either plot is a straight line also confirms the "Gaussianness" of 
the data distribution. 
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(a) Legitimate (pre-attack) traffic. (b) Attack traffic. 

Figure 5. ICMP reflector attack: Q-Q plots for packet rate distribution vs. Gaussian distribution. 



In summary, we conclude that distributions of the packet rate are Gaussian for both legitimate and attack 
traffic. For legitimate traffic the variance-to-mean ratio is about 20.028 and for attack traffic it is 23.018. 
This is evidence that the A/'(/x, afi)-to-J\f{6, aO) model with a = 20 23 is appropriate to use for describing 
the behavior of the traffic flow in the trace. 

We now proceed to the performance analysis. Since the ICMP attack is extremely contrast, we expect 
that any reasonable detection procedure will detect it in no time. As a result, it will be difficult to judge 
whether SR is better than CUSUM or not. This can be overcome by "manually" lowering the intensity of 
the attack, e.g., by applying the following transformation to values of the packet rate for attack's traffic 



Xi = V13600 X 20.028 x 



Xi - 17723.833 
V407968.14 



■ + 13600, 



where Xi is the original value of the attack's packet rate at the i-th sample. This makes the data in the trace 
behave according to Iht M{^i,an)-io-M {9, aO) model with = 13329.764, 9 = 13600 and a = 20.028. 
The result is shown in Figure 6. We see that the attack is not as contrast any more. 

The next step is to make sure that the ARL to false alarm for each procedure is at the same desired level 
7 > 1. To accomplish this, we recall that asymptotically, for 7 sufficiently lai^ge, M{L{Ca) ~ ^/(^gC^) ~ 
\og A/ If - l/(/gC) and ARL(5a) ^ A/C, in our case, Ig ^ 0.1369, // w 0.1342 and C ~ 0.7313. 
Thus, for, e.g., 7 = 1000 (which is moderate from a practical point of view), the threshold for the CUSUM 
procedure should be set to 76.32, and that for the SR procedure to 731.3. We confirmed both thresholds 
numerically using the methodology of Section 4: the actual ARL to false alarm for the CUSUM procedure is 
998.4, and that for the SR procedure is 1000.1. We can now employ both procedures to detect the diminished 
version of the attack. The result is shown in Figure 7. Specifically, Figure 7(a) shows the behavior of the SR 
detection statistic for the first 120 seconds of the trace's data (recall that the attack starts 102 seconds into 
the trace), and Figure 7(b) shows the same for the CUSUM detection statistic. 

We see that both procedures successfully detect the attack, though at the expense of raising three false 
alarms along the way. The detection delay for the SR procedure is roughly 15 seconds (or 30 samples), and 
that for the CUSUM procedure is about 18 seconds (or 36 samples). Thus, the SR procedure is better. Yet 
one may argue that the SR procedure is only slightly better. The difference is small simply because the trace 
is too short: if the attack took place at a point farther into the trace, and we had enough "room" to repeat 
the SR procedure at least a few times, the detection delay would be much smaller than that of the CUSUM 
procedure. Since in real life legitimate traffic dominates, the multi-cyclic idea is a natural fit. 
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Figure 6. ICMP reflector attack: packet rate with attack's intensity diminished. 
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(a) By the Shiryaev-Roberts (SR) procedure. (b) By the CUSUM procedure. 

Figure 7. ICMP (diminished) reflector attack: detection. 



7. CONCLUSION 

We considered the basic iid version of the change-point detection problem for the Gaussian model with 
mean fi > and variance connected via cr^ = afi with a > 0, a known constant. Of the two degrees of 
freedom - fi and a - the change is only in the mean (from one known value to another), though the variance 
is affected as well. For this scenario, we accomplished two objectives. 

First, we carried out a peer-to-peer multiple-measure comparative performance analysis of four detec- 
tion procedures: CUSUM, SR, SRP and SR-r. We benchmarked the performance of the procedures via 
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PoUak's maximal average delay to detection and Shiryaev's multi-cyclic stationary average delay to detec- 
tion, subject to a tolerable lower bound, 7, on the ARL to false alarm. We tested the asymptotic (as 7 — )• 00) 
approximations for operating characteristics using numerical techniques (solving Fredholm-type integral 
equations numerically). The overall conclusion is that in practically interesting cases the approximations' 
accuracy is "reasonable" or better. 

Second, we considered an application of change-point detection to cybersecurity, specifically to the 
problem of rapid anomaly detection in computer networks' traffic. Using real data we first showed that 
network ti'affic's statistical behavior can be well-described by the Gaussian model with = a//; due to 
the fact that the effect of a shift in the mean on the variance can be controlled through parameter a, the 
Gaussian model is more flexible than the traditional Poisson one. We then successively employed the SR 
and CUSUM procedures to detect an ICMP reflector attack. The SR procedure was quicker in detecting the 
attack, which can be explained by its exact multi-cyclic optimality when detecting changes occurring in a 
far time horizon. Based on this we recommend the SR procedure for the purposes of anomaly detection in 
computer networks. 
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